vit-l 14
- Asia > China > Hong Kong (0.04)
- North America > United States > Illinois > Champaign County > Urbana (0.04)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study > Negative Result (0.34)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.72)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.49)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.32)
Embedding-Aware Quantum-Classical SVMs for Scalable Quantum Machine Learning
Ordóñez, Sebastián Andrés Cajas, Torres, Luis Fernando Torres, Bifulco, Mario, Durán, Carlos Andrés, Bosch, Cristian, Carbajo, Ricardo Simón
Quantum Support Vector Machines face scalability challenges due to high-dimensional quantum states and hardware limitations. We propose an embedding-aware quantum-classical pipeline combining class-balanced k-means distillation with pretrained Vision Transformer embeddings. Our key finding: ViT embeddings uniquely enable quantum advantage, achieving up to 8.02% accuracy improvements over classical SVMs on Fashion-MNIST and 4.42% on MNIST, while CNN features show performance degradation. Using 16-qubit tensor network simulation via cuTensorNet, we provide the first systematic evidence that quantum kernel advantage depends critically on embedding choice, revealing fundamental synergy between transformer attention and quantum feature spaces. This provides a practical pathway for scalable quantum machine learning that leverages modern neural architectures.
- South America > Colombia > Cauca Department > Popayán (0.04)
- South America > Colombia > Antioquia Department > Medellín (0.04)
- Europe > Italy > Piedmont > Turin Province > Turin (0.04)
- Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
- Information Technology (0.68)
- Health & Medicine > Therapeutic Area (0.46)
- Asia > China > Hong Kong (0.04)
- North America > United States > Illinois > Champaign County > Urbana (0.04)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study > Negative Result (0.34)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.72)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.49)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.32)
- Europe > France (0.14)
- North America > United States > California > San Francisco County > San Francisco (0.14)
- North America > United States > New York > New York County > New York City (0.04)
- (2 more...)
- Research Report > Experimental Study (0.92)
- Overview (0.67)
- Europe > Switzerland > Zürich > Zürich (0.14)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Asia > Middle East > Jordan (0.04)
- Asia > Japan > Honshū > Kantō > Kanagawa Prefecture (0.04)
- Research Report > Experimental Study (1.00)
- Research Report > New Finding (0.67)
- Overview (0.67)
On the Reproducibility of "FairCLIP: Harnessing Fairness in Vision-Language Learning''
Bakker, Hua Chang, Fris, Stan, Bernardy, Angela Madelon, Deutekom, Stan
We investigated the reproducibility of FairCLIP, proposed by Luo et al. (2024), for improving the group fairness of CLIP (Radford et al., 2021) by minimizing image-text similarity score disparities across sensitive groups using the Sinkhorn distance. The experimental setup of Luo et al. (2024) was reproduced to primarily investigate the research findings for FairCLIP. The model description by Luo et al. (2024) was found to differ from the original implementation. Therefore, a new implementation, A-FairCLIP, is introduced to examine specific design choices. Furthermore, FairCLIP+ is proposed to extend the FairCLIP objective to include multiple attributes. Additionally, the impact of the distance minimization on FairCLIP's fairness and performance was explored. In alignment with the original authors, CLIP was found to be biased towards certain demographics when applied to zero-shot glaucoma classification using medical scans and clinical notes from the Harvard-FairVLMed dataset. However, the experimental results on two datasets do not support their claim that FairCLIP improves the performance and fairness of CLIP. Although the regularization objective reduces Sinkhorn distances, both the official implementation and the aligned implementation, A-FairCLIP, were not found to improve performance nor fairness in zero-shot glaucoma classification.
- Europe > Netherlands > North Holland > Amsterdam (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- (2 more...)
- Health & Medicine > Therapeutic Area > Ophthalmology/Optometry (0.69)
- Education > Curriculum > Subject-Specific Education (0.50)
- Law (1.00)
- Information Technology > Security & Privacy (0.93)
On the Domain Robustness of Contrastive Vision-Language Models
Koddenbrock, Mario, Hoffmann, Rudolf, Brodmann, David, Rodner, Erik
In real-world vision-language applications, practitioners increasingly rely on large, pretrained foundation models rather than custom-built solutions, despite limited transparency regarding their training data and processes. While these models achieve impressive performance on general benchmarks, their effectiveness can decline notably under specialized domain shifts, such as unique imaging conditions or environmental variations. In this work, we introduce Deepbench, a framework designed to assess domain-specific robustness of vision-language models (VLMs). Deepbench leverages a large language model (LLM) to generate realistic, context-aware image corruptions tailored to specific deployment domains without requiring labeled data. We evaluate a range of contrastive vision-language architectures and architectural variants across six real-world domains and observe substantial variability in robustness, highlighting the need for targeted, domain-aware evaluation. Deepbench is released as open-source software to support further research into domain-aware robustness assessment.
- Europe > Switzerland > Zürich > Zürich (0.14)
- North America > United States (0.04)
- Europe > Germany > Berlin (0.04)
- (2 more...)
- Health & Medicine > Therapeutic Area (0.47)
- Health & Medicine > Diagnostic Medicine > Imaging (0.46)
Rethinking Few-Shot Adaptation of Vision-Language Models in Two Stages
Farina, Matteo, Mancini, Massimiliano, Iacca, Giovanni, Ricci, Elisa
An old-school recipe for training a classifier is to (i) learn a good feature extractor and (ii) optimize a linear layer atop. When only a handful of samples are available per category, as in Few-Shot Adaptation (FSA), data are insufficient to fit a large number of parameters, rendering the above impractical. This is especially true with large pre-trained Vision-Language Models (VLMs), which motivated successful research at the intersection of Parameter-Efficient Fine-tuning (PEFT) and FSA. In this work, we start by analyzing the learning dynamics of PEFT techniques when trained on few-shot data from only a subset of categories, referred to as the ``base'' classes. We show that such dynamics naturally splits into two distinct phases: (i) task-level feature extraction and (ii) specialization to the available concepts. To accommodate this dynamic, we then depart from prompt- or adapter-based methods and tackle FSA differently. Specifically, given a fixed computational budget, we split it to (i) learn a task-specific feature extractor via PEFT and (ii) train a linear classifier on top. We call this scheme Two-Stage Few-Shot Adaptation (2SFS). Differently from established methods, our scheme enables a novel form of selective inference at a category level, i.e., at test time, only novel categories are embedded by the adapted text encoder, while embeddings of base categories are available within the classifier. Results with fixed hyperparameters across two settings, three backbones, and eleven datasets, show that 2SFS matches or surpasses the state-of-the-art, while established methods degrade significantly across settings.
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.45)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.34)